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Abstract 

Residue-residue interactions that fold a protein into a unique three-dimensional structure and make 
it play a specific function impose structural and functional constraints on each residue site. Selective 
constraints on residue sites are recorded in amino acid orders in homologous sequences and also in the 
evolutionary trace of amino acid substitutions. A challenge is to extract direct dependences between 
residue sites by removing indirect dependences through other residues within a protein or even through 
other molecules. Recent attempts of disentangling direct from indirect dependences of amino acid types 
between residue positions in multiple sequence alignments have revealed that the strength of inferred 
residue pair couplings is an excellent predictor of residue-residue proximity in folded structures. Here, 
we report an alternative attempt of inferring co-evolving site pairs from concurrent and compensatory 
substitutions between sites in each branch of a phylogenetic tree. First, branch lengths of a phylogenetic 
tree inferred by the neighbor-joining method are optimized as well as other parameters by maximizing 
a likelihood of the tree in a mechanistic codon substitution model. Mean changes of quantities, which 
are characteristic of concurrent and compensatory substitutions, accompanied by substitutions at each 
site in each branch of the tree are estimated with the likelihood of each substitution. Partial correlation 
coefficients of the characteristic changes along branches between sites, in which linear dependences on 
other sites are removed, are calculated and used to rank co-evolving site pairs. Accuracy of contact 
prediction based on the present co-evolution score is comparable to that achieved by a maximum entropy 
model of protein sequences for 15 protein families taken from the Pfam release 26.0. Besides, this excellent 
accuracy indicates that compensatory substitutions are significant in protein evolution. 

Introduction 

The evolutionary history of protein sequences is a valuable source of information in many fields of science 
not only in evolutionary biology but even to understand protein structures. Residue-residue interactions 
that fold a protein into a unique three-dimensional (3D) structure and make it play a specific function 
impose structural and functional constraints on each amino acid. Selective constrains on amino acids are 
recorded in amino acid orders in homologous protein sequences and also in the evolutionary trace of amino 
acid substitutions. Negative effects caused by mutations at one site must be compensated by successive 
mutations at other sites [THl], otherwise negative mutants will be eliminated from a gene pool and 
never reach fixation in a population. Such structural and functional constraints arise from interactions 
between sites mostly in close spatial proximity. Thus, it is suggested that the types of amino acids 
and amino acid substitutions must be correlated between sites that are close in a protein 3D structure 
[5"Ml8j. However, correlations of amino acid types and amino acid substitutions do not only reflect direct 
interactions between sites but also indirect ones through other residues within a protein or even through 
other molecules involved in a molecular complex |19U20] such as oligomerization |16j . protein-substrate, 
protein-protein |18| , and protein-DNA. In addition, statistical noise due to the small number of samples 
and methodological limitations are obstacles to decode correlations into spatial relationships between 
sites. However, protein families consisting of homologous sequences in a wide range of divergence are now 
collected in protein family databases such as Pfam [21] , and become available to reduce statistical noise to 
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a sufficiently small amount. A present challenge is thus to extract only direct dependences between sites by 
excluding indirect correlations between them from a wide variety of homologous sequences evolutionarily 
exploited in a sequence space [22 27] . 

Extracting essential information from the evolutionary sequence record have been attempted using 
global statistical models. A Bayesian graphical model was applied to disentangling direct from indirect 
dependencies between residue positions in multiple sequence alignments of proteins [23] , and a significant 
improvement was achieved in the accuracy of contact prediction [24]. A Bayesian graphical model was also 
applied to the analysis of the joint distribution of substitution events to identify significant associations 
among residue sites [22]. Recently, remarkable accuracy of contact prediction was achieved [25][2B] by 
using a maximum entropy model [18] of a protein sequence, constrained by the statistics of a multiple 
sequence alignment, to infer residue pair coupling. Partial correlation coefficients derived from mutual 
information of residue pair coupling were also used to extract direct information |27j . They developed 
not only a robust method to extract essential correlations of amino acid type between residue positions in 
multiple sequence alignments, but also showed that inferred residue-residue proximities provide sufficient 
information to predict a protein fold without the use of known three-dimensional structures. 

Here, we report an alternative approach of inferring co-evolving site pairs from concurrent and com- 
pensatory substitutions between sites in each branch of a phylogcnetic tree. First, branch lengths of 
a phylogcnetic tree inferred by the neighbor-joining (NJ) method [28j are optimized by maximizing a 
likelihood of the tree in a mechanistic codon substitution model [2j|J[30] ■ The variation of selective con- 
straints over sites is approximated by a discrete gamma distribution [31j . Mean changes of the various 
types of quantities, which are characteristic of concurrent and compensatory substitutions, accompanied 
by substitutions at each site in each branch of the tree are estimated on the basis of likelihoods. Par- 
tial correlation coefficients of their characteristic changes accompanied by substitutions along branches 
between sites are employed to remove a linear dependence on characteristic changes along branches at 
other sites. In other words, a Gaussian graphical model [32] is assumed for site dependence, because a 
conditional independence between two variables given other variables in a multi-variate Gaussian distri- 
bution is equivalent to zero partial correlation coefficient between the two variables. Then, co-evolution 
scores arc defined from partial correlation coefficients of the various types of the characteristic quantities. 
Co-evolving site pairs are inferred in the decreasing order of the overall co-evolution score. Accuracy 
of contact prediction based on the partial correlation coefficients is comparable to that by a maximum 
entropy model 261 for 15 protein families taken from the Pfam release 26.0 [21], indicating that the 
present method can be an alternative approach for contact prediction. Also, a fact that contact site 
pairs can be well predicted by the present method strongly indicates that compensatory substitutions 
are significant in protein evolution, because the present method based on concurrent and compensatory 
substitutions will not work at all if all substitutions are completely neutral. 

Results 



Framework of the present method 

The framework of the present method is shown in Fig. [T] For each protein family, its phylogenetic 
tree T inferred by the neighbor-joining (NJ) method is taken from the Pfam database [21] and branch 
lengths if, of the tree are optimized by maximizing the likelihood of the tree in a mechanistic codon 
substitution model. Then, the average changes (Aj&) of quantities, which are characteristic of concurrent 
and compensatory substitutions, accompanied by substitutions at each site i in each branch b of the 
phylogcnetic tree T are estimated with the likelihood of each substitution. Their correlation coefficients 
Ai Aj ) along branches between sites are calculated, and converted to partial correlation coefficients, 
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which are correlation coefficients between residual vectors (IIx{A fc = Si arL d H±{A k ^ i • °f gi ven 

two vectors that are perpendicular to a subspacc consisting of other vectors except those two vectors 
(Aj and Aj) and therefore cannot be accounted for by a linear regression on other vectors. Finally, 
co-evolution scores based on the partial correlation coefficients are defined and used to rank site pairs for 
close spatial proximity. 

The following characteristic changes defined in the Methods section are examined to detect concurrent 
and compensatory substitutions between sites; (1) occurrence of amino acid substitution: A* 1 , (2) side- 
chain volume: A", (3) side-chain charge: Af, (4) hydrogen-bonding capability: A^ h , (5) hydrophobicity: 
A' 1 , (6) a propensity: A", (7) (3 propensity: Af , (8) turn propensity: A*, (9) aromatic interaction: 
A- r , (10) branched side-chain: A- r , (11) cross-link capability: A.f, and (12) ionic side-chain: A- on . All 
except the a propensity arc used to define co-evolution scores. 

Protein families and sequences used 

In order to calculate partial correlation coefficients between sites by taking the inverse of a covariance or 
correlation matrix, it must be regular so that the number of homologous sequences must be at least more 
than the number of sites, that is, the sequence length. To obtain statistically reliable numbers, even more 
sequences than 10 times as many as sites would be needed. In the Pfam database [21], protein domain 
families consisting of many thousands of homologous sequences are included. Particularly, protein domain 
families used in |26j to infer residue pair couplings in multiple sequence alignments are appropriate to 
allow us to compare prediction accuracies between the present method and their method. These protein 
domain families in the Pfam release 26.0 (November 2011) are listed in Table[T] Also, Table [T] shows the 
Uniprot ID and corresponding protein coordinates (PDB ID) of a target protein in each protein family, 
for which co-evolving site pairs are predicted. 

In the Pfam database, there are two sets of sequence alignments for each protein family; a seed 
alignment and a full alignment. Also, a phylogcnctic tree inferred from each alignment by the NJ method 
[28j are available. Here the seed alignment and its phylogcnctic tree are used to estimate parameters in a 
mechanistic codon substitution model [29[[30]. With those parameters optimized for the seed alignment 
in the codon-based model, posterior means of characteristic variables at each site in each branch of a 
phylogcnctic tree are estimated for subsets of a full alignment, after branch lengths are optimized. 

The full alignments include closely-related sequences whose differences are less than 0.01. The num- 
ber of branches (n&) in a phylogenetic tree is proportional to the number of OTUs (n otu ) (operational 
taxonomic units that correspond to sequences in the present case); rib = 2n otu — 3 for an unrooted tree. 
Computational time required for the present calculation increases with increasing number of branches. 
Including closely-related sequences requires computationally intensive calculation, although it is not much 
informative; invariant sites do not have any information in the present method, which is designed to detect 
concurrent and compensatory substitutions between sites in proteins. Thus, subsets made by excluding 
closely- related sequences from the Pfam full alignments are used in the present calculations. The subsets 
of a full alignment and their NJ trees are made by removing OTUs that are connected to the parent 
nodes with branches shorter than a certain threshold (Tfct), although seed sequences and a target protein 
are not removed. 

In addition, to reduce a computational load in the calculation of the likelihood of a phylogenetic tree, 
only site positions where amino acids are found in the target protein are extracted from the multiple 
sequence alignment and used in the present analysis. 
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Correlation versus partial correlation coefficients 

First, we examined how differently correlation coefficients and partial correlation coefficients between 
sites identify dependent site pairs. The distribution of Pearson's correlation coefficient in the case of 
no correlation can be well approximated by the Student's t distribution. Therefore, here a correlation 
coefficient r t corresponding to the E- value E t = 0.001 (the P-value Pt = E t /n pa i IS ) in the Student's 
t-distribution of the degree of freedom df = n& — 2 is used as a threshold for significance; where n pa i rs is 
the number of site pairs and rib = 2n otu — 3 is the number of branches in a unrooted phylogcnetic tree. 

In Tablc[2j correlation coefficients (rAjA») and partial correlation coefficients (rrj ± a s h ± a s ) of substi- 
tution probabilities along branches between sites are classified into four categories; significantly positive, 
positive but insignificant, negative but insignificant, significantly negative. In addition, sites pairs in 
each category are classified according to whether they are contact residue pairs in the protein 3D struc- 
ture. Contact residue sites are arbitrarily defined as residue pairs whose minimum atomic distances are 
shorter than 5A, and which are separated by 6 or more residues along a peptide chain. The upper table 
shows results for Pearson's correlation coefficients and the lower table does those for partial correlation 
coefficients. Significantly-positive correlation coefficients are found for almost all site pairs. In the phylo- 
genetic trees of these protein families branch lengths are completely heterogeneous. The expected value 
of the probability of amino acid substitution in a branch is an increasing function of branch length; 
A| b ss (1 — exp(— fiitb)) where pt is an amino acid substitution rate for site i. Thus, Pearson's correla- 
tion coefficients of the expected values of substitution probability over branches between sites should be 
significantly positive, as shown in Table [2l 

On the other hand, a partial correlation coefficient defined in Eq. [I5]is a correlation coefficient between 
residuals that cannot be accounted for by a linear regression on the vectors of characteristic changes along 
branches at other sites. When dependences on other sites in the variation of substitutions are removed, 
significantly positive correlations (r > r t ) are found only in a limited number of site pairs, and most 
site pairs show insignificant correlations. Furthermore, site pairs in the category of significantly-positive 
correlation tend to be contact residue site pairs with significantly- high probabilities; see the column of 
positive predictive value, PPV = TP/(TP + FP), where TP and FP are the numbers of true and false 
contact residue pairs, respectively. This result clearly indicates that the partial correlation coefficients 
represent the strength of the direct dependences of substitutions between sites. 

Co-evolution scores for site pairs 

Concurrent substitutions between sites require that the direct correlation of substitutions must be pos- 
itive. Therefore, only positive values of the partial correlation coefficients (C?-) arc used to define a 
co-evolution score (pfj) based on concurrent substitutions. 

p\ 3 = max(q,, 0) (1) 

For all other characteristic variables employed to detect co-evolving site pairs, the condition of concurrent 
substitutions between sites are a premise. Thus, instead of using partial correlations of characteristic vari- 
ables themselves, the geometric mean of the partial correlation coefficient of each characteristic variable 
and the co-evolution score based on concurrent substitutions is used as a co-evolution score based on each 
characteristic change. 

p% = sgnC* Mf^Y' 2 for xG{v,c,hb,h,...} (2) 

As already mentioned in the Method section, negative correlations are required for characteristic 
variables such as volume, charge, and hydrogen bonding capacity to reflect compensatory substitutions. 
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In Tabic El TP and FP for each category of significantly positive (p?- > r t ) and negative (pfj < —rt) 
correlations under the condition of \pfA > of, > r t are listed for each characteristic variable. In the 
cases of volume, charge, and hydrogen bonding capacity, PPV for contact residue pairs is clearly larger 
in the category of significantly negative correlation than significantly positive correlation, indicating 
that these quantities to detect compensatory substitutions between sites are good predictors for close 
spatial proximity. Besides, there are more site pairs with significantly negative correlations than with 
significantly positive correlations, clearly indicating the presence of structural constraints against these 
physico-chemical changes. 

To improve contact prediction by using characteristic variables p x together with the characteristic 
variable p s of concurrent substitutions, the PPV for the category of significantly positive or negative 
correlations should be larger than the PPV for concurrent substitutions. Both categories of significantly 
positive and negative correlations show better PPVs in the characteristic variables of hydrophobicity, 
j3 and turn propensities, aromatic interaction and branched side-chain. In the characteristic variables 
of cross-link capability and ionic side-chain, only the category of significantly positive correlation shows 
better PPV than the category of significantly positive correlation for concurrent substitutions. The a 
propensity is not effective to detect contact residue pairs, although it may be effective to detect residue 
pairs within a helix or within helices. Based on these results, an overall co-evolution score for site pair 
is defined here as 

Pij = maxfpfj, max(— p" J5 0),max(— p\j, 0),max(— p^ b ,0), 

141, \4\> 14-1' K\< l4l,niax(p^,0),max(p-,0)] (3) 

Contact prediction based on the overall co-evolution score 

Co-evolving sites pairs are selected for contacts in the decreasing order of the overall co-cvolution score 
Pij. Although this score for co-evolution appears to be able to predict contact site pairs, preliminary 
results of contact prediction indicate that both terminal sites in multiple sequence alignments often have 
large values of pfj- (x ^ s) for any other site, and also that there are a few sites showing extremely large 
values for y\ H(pij —r t ); the H denotes the Heaviside step function. Such an anomalous feature probably 
indicates a poor quality at these sites in multiple sequence alignments. Thus, in contact prediction, 

1. the co-evolution scores of pfj (x ^ s) are ignored for both terminal sites in multiple sequence 
alignments; that is, pij = pfj. 

2. Also, if Y]j H(pij — r t ) > 15, p^ = p\j will be used for site i, and 

3. if H(f ij ~ r t) > 15, p^ = will be used and such a site will be excluded in contact prediction. 

The threshold value rt used here is the value of correlation coefficient corresponding to E- value = 0.0001 
in the Student's t-distribution. In the present contact prediction, only one site that is the N-terminal 
site in the multiple sequence alignment for the KH_1 was excluded as an anomalous site. 

Needless to say, the norm of any characteristic change vector is almost zero for invariant sites; || A.;j ~ 
0. Therefore, invariant sites are excluded in the present method for contact prediction. 

Accuracy of contact site pairs predicted on the basis of the overall co-evolution 
score 

Accuracies of predictions based on the overall co-evolution score and on the direct information (DI) 
score [55] calculated by a maximum entropy model are compared by using three measures in Table 3] 
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for protein families listed in Table [TJ Those three measures are PPV, mean Euclidean distance from 
predicted site pairs to the nearest true contact (MDPNT) in the 2-dimcnsional sequence-position space, 
and the mean Euclidean distance from every true contact to the nearest predicted site pair (MDTNP). 
The MDPNT and MDTNP, which were defined and used in [Hj , are qualitative measures of false positives 
and of the spread of predicted site pairs over true contacts, respectively. Smaller values of these measures 
indicate better predictions. For the predictions based on the DI score, the filters based on a secondary 
structure prediction and on disulfide bonds are not applied to the DI scores but only the filter |26j based 
on the degree of residue conservation at each site is, in order to compare the prediction accuracies of 
co-evolution and DI scores themselves. 

The reliability of predicted co-evolving site pairs decreases with decreasing value of co-evolution score, 
and co-evolving site pairs are selected in the decreasing order of co-evolution score, therefore prediction 
accuracy tends to decrease as the total number of predicted sites pairs increases; see Fig. [2] In Table 0] 
the results of predictions in which the numbers of predicted contacts are equal to one fourth or one third 
of the number of true contacts are listed for each protein family. One third of the total number of true 
contacts is equal to the sequence length in the case of Trypsin, which has the largest number of contacts 
per residue, and equal to half of the sequence length in the case of Trans_reg_c and 7tm_l, which have 
the smallest number of contacts per residue. This ratio was chosen, because Marks et al. [26] reported 
that one needs about 0.5 to 0.75 predicted distance constraints per residue or about 0.25 to 0.35 of the 
total number of contacts, to achieve reasonable three-dimensional structure prediction. 

In Fig. [3] and Fig. IS H co-evolving site pairs are shown in the lower half of each triangular map 
in comparison with residue pairs whose minimum atomic distances are less than 5 A. For comparison, 
contact residue pairs predicted with high DI scores [25] ar e also shown in the upper half of each triangular 
map. Gray filled-squarcs, red and indigo fillcd-circlcs indicate such residue-residue proximities, true and 
false positives in contact prediction, respectively. It should be noted here that residue pairs separated 
by 5 or fewer positions (\i — j\ < 5) in a sequence may be shown with the gray filled-squares but are 
excluded in the prediction of co-evolving site pairs and in the contact prediction with the DI score. The 
total number of predicted site pairs is equal to one third of the total number of true contacts in each 
protein structure. 

In Table [4] which method is better in the accuracy of contact prediction is indicated by a bold 
font. The PPVs of the present method are comparable to or better than the DI method except CH, 
Kunitz_BPTI, Lectin_C, and RNase_H. In Fig. [2] the PPVs of the present method and the DI method 
are drawn by solid and dotted lines as a function of the ratio of predicted to true contacts, respectively. 
Also, the values of MDPNT and of MDTNP are compared between the present and DI methods and 
also between the protein families in Figs. IS2I and [S3l respectively. The good performance of the present 
method is shown over a wide range of predicted site pairs. 

Dependence of the prediction accuracy on the number of predicted site pairs 

The dependences of the accuracy of predicted contacts on their number are shown in Fig. [5] for PPV, 
in Fig. [S2] for MDPNT, and in Fig. [H for MDTNP. The total number of predicted site pairs takes 
every 10 from 10 to a sequence length; also accuracies for the numbers of predicted contacts equal to one 
third or one fourth of true contacts are plotted. Here, in order to compare prediction accuracies between 
protein families, the total number of predicted contacts is shown in the scale of the ratio of predicted to 
true contacts. It is clearly shown that there is an overall trend for PPV to decrease monotonically as 
increasing number of predicted site pairs. However, exceptional increases of PPV are also observed with 
increasing number of site pairs predicted. In the protein family of CH, PPV changes from 0.43 to 0.5 as 
the number of predicted site pairs increases from 30 to 50. Because except the case of CH such abnormal 
increases of PPV often occur in the range of small numbers of predicted site pairs, i.e., from 10 to 30, 
they may be caused by statistical fluctuations. 
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It is shown in Table 2] and Fig. [S2] that the relationships of MDPNT with the ratio of predicted 
to true contacts are almost inverse of that of PPV, indicating that the MDPNT and PPV are two 
different measures of the quality of predicted site pairs but result in similar evaluations. On the other 
hand, MDTNP, which measures the spread of predicted site pairs over true contacts, differently measures 
the qualities of predicted contacts from PPV and MDPNT. It tends to decrease monotonically as the 
increasing number of predicted site pairs irrespective of the quality of prediction accuracy, and therefore 
it is not appropriate to measure the dependence of prediction accuracy on the total number of predicted 
site pairs. 

Dependence of the prediction accuracy on protein fold types 

As expected, prediction accuracy is different between proteins. However, it is unexpected that prediction 
accuracy may be slightly lower for a proteins, at least for the present three proteins, than for /3 proteins; 
see Fig. [5] Especially the prediction accuracy for the membrane protein 7tm_l is remarkably lower than 
other two a proteins. This feature is observed in both the present and DI methods. Thus, this feature 
may originate in differences between structural constraints in a-a packing and in packings of (3 strands 
and of (3 sheets, although the low prediction accuracy for the membrane protein 7tm_l would result 
from a-a packing peculiar to membrane proteins. Here it should be noted that the a proteins have less 
contacts per residue than the j3 proteins; see Table |H A definitive answer must be postponed until more 
a proteins are analyzed. 

Dependence of the prediction accuracy on the diversity and the number of 
sequences used 

Multiple subsets of a full alignment are generated by using different values of threshold Tu for branch 
length to remove OTUs connected to their parent nodes with short branches. In Fig. HJ Fig. IS4| and 
Fig. IS51 the PPVs, MDPNTs, and MDTNPs calculated from each data are plotted against the number of 
sequences used, respectively. Because the threshold values used to generate each datasct should also affect 
the accuracy of prediction, they arc written near each data point. A general tendency is of course that 
the PPV and MDPNT are improved by using more sequences. However, the number of sequences and the 
threshold Tu where accuracy improvement is saturated are very different between protein families. For 
example, in the case of SH3_1, no significant improvement in the PPV and MDPNT is observed in a wide 
range of 0.2 > Tu > 0.001, even if the number of sequences increases from 1500 to 4000. In RNase_H, 
the PPV and MDPNT are almost constant in the range of 0.05 > T bt > 0.001 and 2120 < n otu < 7048. 
In Response_reg, after the PPV reaches the highest value 0.73 at Tu = 0.6 and n tu = 3344, it even 
decreases to 0.69 in 3344 < n tu < 7613, although its decrement is not large and the MDPNT is almost 
constant in this region. Multiple sequence alignments may include many sites where significant fraction of 
sequences have deletions, reducing effectively the number of sequences; for example, RNascJT However, 
it may be worth increasing the number of sequences until Tu ~ 0.01. Here calculations have been carried 
out until n tu ~ 7000 or Tbt ~ 0.01. A few thousand sequences are necessary to get a reliable prediction 
for proteins consisting of a few hundred residues. 

Some data points in Fig. |H Fig. IS4| and Fig. [S5] correspond to datasets generated by using the same 
value of threshold but by removing different OTUs. PPV often differs between such datasets, although 
the difference of PPV ranges from a few percent to 8 percent; see the PPVs for Tbt = 0.2 of Trans_ref_C, 
Tbt = 0.02 of CH, and Tu = 0.5 of Cadherin in Fig. [4] This fact indicates that the distribution of 
sequences in a sequence space significantly affects prediction accuracy. Also, it is indicated that some 
site pairs predicted are still based on rare events of concurrent substitutions in a tree. 
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Discussion 

Significance of compensatory substitutions in protein evolution 

It has been shown that site pairs giving the significant values of partial correlation coefficients for substitu- 
tions, which concurrently occurred in branches of a phylogcnctic tree and would be mostly compensatory 
substitutions, well correspond to contact site pairs in protein 3D structures. In compensatory substitu- 
tions, the fitness of first mutations must be negative, and successive mutations must occur to compensate 
the negative effect of the first mutation. A time scale in which compensatory mutations successively 
occur is much shorter than the time scale of protein evolution that is the order of fixation time for 
neutral mutations, otherwise negative mutants will be eliminated from a gene pool by selection. Thus, 
negative substitutions and their compensatory substitutions are expected to be observed as concurrent 
substitutions in the same branch of a tree. If substitutions are completely neutral, there will be no 
correlation in time when substitutions occur. Thus, a fact that contact site pairs can be well predicted 
by the present method indicates that compensatory substitutions are significant in protein evolution. 
Significance of compensatory substitutions was also indicated by a fact that likelihoods of phylogenetic 
trees can be significantly improved by taking account of codon substitutions with multiple nucleotide 
changes [251150]. 

Limitations in prediction accuracy 

Prediction accuracy of contact residue pairs is different between the protein families. Possible reasons for 
false positives include (1) statistical noise due to an insufficient number of sequences, insufficient diversity 
of sequences, and incorrect matches in a multiple sequence alignment and an incorrect phylogenetic tree, 
(2) structural and functional constraints from other residues, which are not taken into account in the 
calculation of partial correlation coefficients from a correlation matrix, within a protein or even through 
other molecules involved in a molecular complex such as oligomcrization, protein-substrate, and protcin- 
DNA, (3) structural variance in homologous proteins, although each Pfam family is assumed to be 
iso-structural. 

In reducing statistical noise, the diversity of sequences in protein families is more effective than the 
number of sequences itself. Also, the presence of many deletions in sequences reduces the value of 
including those sequences. The present subset (Tf, t = 0.01) of the full alignment of RNase_H family 
consists of more than 4700 sequences, but their multiple sequence alignment includes many sites where 
the significant fraction of sequences have deletions. Here, branch lengths of sequences from their parent 
nodes in a phylogenetic tree are used to get less sequences but as diverse sequences as possible. Sequences 
whose branch lengths from their parent nodes are longer than 0.01 amino acid substitutions per site in 
a NJ tree are needed to be more than a few thousand, in order to get useful numbers (> 35%) of PPV. 
This requirement seems to be similar to that for the maximum entropy model [26] , in which the order of 
one thousand sequences are required to reduce statistical noises including phylogenetic bias in frequency 
counts. 

Dependence on the accuracies of sequence alignment and phylogenetic tree 

We do not extensively examine the dependence of contact prediction on the accuracy of phylogcnctic 
tree. We tried to optimize a phylogenetic tree as well as model parameters for Kunitz_BPTI, Lectin_C, 
RNase_H, and Rcsponse_reg before evaluating characteristic variables at each site in each branch. The 
prediction accuracy of contact site pairs were not significantly improved. The improvement of prediction 
accuracy by a tree optimization from a NJ tree was within a few percent. The optimization of tree may 
be not cost-effective, because it requires a intensive calculation especially for thousands of sequences. An 
accurate multiple sequence alignment would be more critical to increase prediction accuracy. 
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In the present calculations, sites that arc deletions in a target protein structure are excluded in the 
optimization of tree branches and in the calculation of partial correlation coefficients. The calculation 
of partial correlation coefficients by including those sites has been attempted for the Kunitz_BPTI and 
RNasc_H domain families. No improvement was obtained at least for those protein families. 

A difference from methods of extracting residue type dependencies between 
sites 

So far remarkable improvements in the accuracy of contact prediction were all achieved by extracting 
essential correlations of amino acid type between residue positions from multiple sequence alignments 
P51I251 - I2T] . Here, almost comparable accuracy of contact prediction has been achieved by evaluating 
direct correlations of concurrent and compensatory substitutions between sites. The present method 
cannot be applied to the cases in which all substitutions are nearly neutral. In such a case, structural and 
functional constraints from closely-located sites in protein 3D structures are reflected only in the joint 
distribution of amino acid types between the sites. 

Residue-residue interactions maintaining secondary structures appear to be more easily detected by 
the joint distribution of amino acid types between the sites than concurrent substitutions. In general, 
the present method detects less dependences between neighboring sites along a sequence than the other. 
Marks et al. [55] reported that residue pairs separated by four or five positions in a sequence often 
have high DI scores without being in close physical proximity in the folded protein. Even for site pairs 
separated by more than five positions, their method based on the joint distribution of amino acid types 
detected more dependences in a helical regions than the present method; see 7tm_l in Fig. |3| 

From a such viewpoint, methods of extracting direct correlations of amino acid type between sites may 
be better for extracting direct dependences between sites than those of detecting compensatory substi- 
tutions in a tree. However, interactions between closely-located sites do not necessarily result in distinct 
correlations of amino acid type between the sites. For example, hydrophobic interactions are relatively 
non-specific, but significantly contribute to residue-residue interactions inside protein structures. The 
types of amino acids found inside protein structures are mainly hydrophobic. In the case of membrane 
proteins, most of amino acids embedded in membrane are hydrophobic. In the case that residue-residue 
interactions do not result in distinct correlations of amino acid type between sites, methods of detect- 
ing compensatory substitutions should perform better than methods of extracting direct correlations of 
amino acid type between sites. Membrane proteins may be this case; see 7tm_l in Fig. [2j Structural 
analyses of membrane proteins, especially the determinations of protein coordinates in transmembrane 
regions, are difficult in comparison of globular proteins. The present method for contact prediction could 
facilitate the coordinates determinations of membrane proteins. 

Besides, the observed joint distributions of amino acid types include more or less phylogenetic bias as 
a statistical noise, but there is no such a noise in the present method. Thus, the both types of methods 
are complementary to each other. 

A difference between Bayesian graphical models and the present model 

A Bayesian graphical model was applied to disentangling direct from indirect dependencies between 
residue positions in multiple sequence alignments of proteins [23j . In the Bayesian graphical model, 
an acyclic directed graph is assumed for site dependence, although interactions between sites in protein 
structures act on each other. A causal relationship between substitutions is of course directional. However, 
substitutions at a site affect on closely-located sites, and also the site is affected by substitutions at those 
surrounding sites. Thus, dependence between sites should be assumed to be bidirectional or undirectional. 
Unlike Bayesian graphical models, an undirected graph is assumed in a Gaussian graphical model [32], in 
which a null edge between two nodes encodes that random variables assigned to the nodes are conditionally 
independent of each other given the values of random variables assigned to other nodes. Assuming that 
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a joint probability density distribution of random variables is a multivariate Gaussian distribution, two 
random variables are conditionally independent given the values of other random variables if and only 
if a partial correlation coefficient between the two random variables is equal to zero. Thus, the present 
model based on partial correlation coefficients can be regarded as a Gaussian graphical model in which 
an undirected graph is assumed for dependences between sites and a feature vector Aj is assigned to 
node i as the observed values of a random variable. This is one of essential differences between the 
present model and the Bayesian models [22H24] . although there is another essential difference that the 
joint distributions of residues at sites were analyzed in [23l[24] . 



Methods 



Mean of characteristic changes accompanied by substitutions at each site in 
each branch of a phylogenetic tree in a maximum likelihood model 

Assuming that substitutions occur independently at each site, a likelihood P(A\T, 9) of a sequence 
alignment A in a phylogenetic tree T under a evolutionary model is represented as a product over sites 
of the likelihood of a sequence alignment Ai for site i. 

P{A\T,e) = Y[P(Ai\T,Q) (4) 

i 

P{Ai\T,G) = ^P(A|T,9,0 Q )P(0 Q ) (5) 

where the distribution of 9 a for the variation of selective constraint |29U30| are assumed to be a priori 
equal to P(9 a ), The evolutionary model corresponds to a substitution model of amino acid or codon. 
Then, if substitutions are assumed to be in the equilibrium state of a time-reversible Markov process, the 
likelihood of a sequence alignment Ai for site i will be calculated by taking any node as a root node. Let 
us assume here that the root node is a left node (v^l) of a branch b. 

p(Ai\T,e,e a ) = ^^p(^|u 6i = / t ,^R = A,T,e,^) (6) 

f%ohtifoyfeep k, v bR = A, T, 0, 9 a ) = (7) 
P b L(Ai\v bL = k, T, 0, 6 a )f K P{X\K, t b , 0, e a )P bR (Ai\v bR = A, T, 9, a ) (8) 

where k and A indicate the type of codon or amino acid depending on the evolutionary model, and f K is 
the equilibrium frequency of k. P(A|k, t&, O, 9 a ) is a substitution probability from n to A at the branch b 
whose length is equal to tb- PbL{Ai\i'bL = k, T, 0, 9 a ) is a conditional likelihood of the left subtree with 
v b L = K [33j . In the maximum likelihood (ML) method for phylogenetic trees, the tree T and parameters 
are estimated by maximizing the likelihood. 

(f,6) = argmaxP(^|T,e) (9) 

In this model, the mean A;/, of a quantity A K \ accompanied by substitutions from k to A at each site i 
in each branch b can be calculated as follows. 



A K .xP{Ai\v bL = K,v bR = A, T, Q,9 a ) 



A lb (A t ,T,Q) = YA ib (Ai,f,e,6 a )P(0 a \Ai,T,e) (11) 
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where P(9 a \Ai,T,Q) is a posterior probability calculated from 

If A K \ is defined to be equal to 1 for k ^ A and for k = A, A^^ ) (-4^,^ , , 6) will represent the expected 
value of substitution probability at site i in branch b. Let us define a vector A^ as follows, and consider 
the correlation of the two vectors, A^ and Aj. 

A, ee (... , A ib (A i ,f,e)- ^ bAib ( A \' T ' e) ,...)' (13) 

where / denotes the transpose of a matrix. A correlation matrix C is defined to be a matrix whose 
element is the correlation coefficient r^Aj between Ai and Aj. 

^ = (14) 

where (Ai, Aj) denotes the inner product of the two vectors. The correlation between sites i and j may 
be an indirect correlation resulting from correlations between sites i and k and between sites k and j. To 
remove such indirect correlations, partial correlation coefficients are used here. The partial correlation 
coefficient is a correlation coefficient between residual vectors {Tl±{A klii } Aj an( i n_L{A Mi ■} Aj) of given 
two vectors that are perpendicular to a subspace consisting of other vectors except those two vectors (A; 
and Aj) and therefore cannot be accounted for by a linear regression on other vectors; Tlj_^ k ^. .j is a 
projection operator to a space perpendicular to the subspace. If the correlation matrix is regular, then 
the partial correlation coefficients Cij will be related to the element of its inverse matrix. 

n = r _ (n± { A M ,,}A 8 ; , n ±{AMi j} A 3 ) _ (g~l)g 



Characteristic variables indicating co-evolution between sites 

The following characteristic changes accompanied by substitutions whose correlations indicate co-evolution 
between sites have been used here. 

1. Occurrence of amino acid substitution. 

The most primary quantity is one (A s ) that is defined as follows and indicates the occurrence of 
amino acid substitution at a site. 

K,x = 1- <*-„,-* (16) 

where 5 an ,ax IS the Kronecker's S that takes 1 if a K = a\ and otherwise. The a K is the type 
of amino acid corresponding to k. A? b (Ai,T,Q) in Eq. [TT] indicates the expected value of the 
probability of amino acid substitution at site i in branch b. This quantity was also used for 
the prediction of contact residue pairs in protein structures. 

2. Volume change of a side chain accompanied by an amino acid substitution. 

Protein structures must be tightly packed |34) , and therefore mutations between amino acids whose 
side chain volumes significantly differ tend to unstabilize protein structures and therefore will be 
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eliminated from a gene pool by selection [35] unless the volume change is compensated by successive 
mutations at sites closely located in protein structures. Thus, the volume changes of side chains 
caused by amino acid substitutions are used to detect co-evolution between closely located sites in 
protein structures. 

A = side_chain_volume aA — side_chain_volume afs (17) 

where sidc_chain_volume aA means the volume of side chain a\ . The amino acid volumes used here 
are the mean volume occupied by each type of amino acid in protein structures, and taken from 
the set named BL+ in Table 6 of [35] j the volume of a half cystine (labeled as " cys" in the table) 
is used here for a cysteine. 

3. Charge change of a side chain accompanied by an amino acid substitution. 

Charge-charge interactions in protein structures are known to be significant. Substitutions that 
keep favorable charge-charge interactions are expected to be advantageous in selection. 

A£ A = side_chain_charge aA — side_chain_charge aK (18) 

where side_chain_charge a represents a charge of side chain type a K and takes 1 for positively charged 
side chains (arg and lys), 0.1 for his, and —1 for negatively charged ones (asp, glu). 

4. Change of hydrogen-bonding capability accompanied by an amino acid substitution. 

One of the most important interactions to stabilize protein structures is a hydrogen-bonding inter- 
action. Substitutions that keep hydrogen-bonds are favorable. In order to detect whether hydrogen- 
bonds between side chains can be kept despite substitutions, the change of hydrogen-bonding ca- 
pability is defined here as 

Ajj b A = acceptor _capability QA — acceptor .capability^ + /nonumber (19) 
donor_capability aA — donor_capability aK (20) 

where acceptor_capability aK takes —1 if a side chain a K can be an hydrogen-bonding acceptor and 
otherwise. Donor_capability a takes 1 if a side chain a\ can be a hydrogen-bonding donor and 
otherwise. Hydrogen-bonding acceptors are asn, asp, gin, glu, his, ser, thr, and tyr. Hydrogen- 
bonding donors are arg, asn, gin, his, lys, ser, thr, trp, and tyr. A negative correlation is expected 
for this quantity between closely located sites in a protein 3D structure. 

5. Change of hydrophobicity accompanied by an amino acid substitution 

Also, hydrophobic interactions are crucial for a polypeptide chain to be folded into a unique three- 
dimensional structure. Hydrophobic interactions may be correlated between substitutions at nearby 
sites in a protein 3D structure. 

^k,A = e a x r — e a „r (21) 

where e ar . r is the mean contact energy of an amino acid a K with surrounding residues (r) in protein 
structures; see [37] for its exact definition. 

6. Changes of /3 and turn propensities accompanied by an amino acid substitution. 
Changes of /3 and turn propensities [38j are also examined. 

A^ A = /3_sheet_propensity a — /3_sheet_propensity ajt 
Aj. a = turn_propensity aA — turn_propensity aK 

where /3_sheet_propensity a is the value of (3 sheet propensity [35] of amino acid a K 
of a propensity is also examined but it is not used as a co-evolution score. 



(22) 
(23) 

. The change 
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7. Change of the capability of aromatic interaction accompanied by an amino acid substitution. 



^ K \ — ^aromatic_side_chains,a^ - 5 aromatic ^ idc . chains , aK (24) 

where 5 ar omatic_amino_acids,a K i s equal to 1 if a K is one of aromatic side-chains (his, phe, trp, and tyr) 
and otherwise. 

Change of branched side-chain accompanied by an amino acid substitution. 



^k,X — ^aliphatic_branched_side_chains,aA ^aliphatic_branched_side_chains,a K 

(25) 

where £branched_side-chains,a K is equal to 1 if a K is one of aliphatic branched side-chains (ile, leu and 
val), and otherwise. 

9. Change of cross-link capability accompanied by an amino acid substitution. 



^k.A — ^cross_liiik,a^ ^cross_link,a K (^^) 

where <5 C rossJmk,a K is equal to 1 if a K is one of asn, gin, ser and thr, and otherwise. 
10. Change of ionic side-chain accompanied by an amino acid substitution. 

^k,A = ^inonic_sidc_chains,a> 1 ^inonic_sidc_chains,a K 

(27) 

where (Sionic-side.chains,^ is equal to 1 if a K is one of inonic side-chains (asp, glu, arg, and lys), 0.1 if 
a K is his, and otherwise. 

A mechanistic codon substitution model for the maximum likelihood inference 
of phylogenetic tree 

A mechanistic codon substitution model, in which each codon substitution rate is proportional to the 
product of a codon mutation rate and the average fixation probability depending on the type of amino 
acid replacement, has advantages over nucleotide, amino acid, and empirical codon substitution 

models in evolutionary analysis of protein-coding sequences, because mutation at the nucleotide level and 
selection at the amino acid level can be separately evaluated. Even for amino acid sequences of OTUs 
(operational taxonomic units), the mechanistic codon substitution model with the prior assumption of 
equal codon usage for them yields smaller AIC values (Akaike Information Criterion) than any amino 
acid substitution model does (unpublished). Thus, the mechanistic codon substitution model [30] is used 
here to evaluate the likelihood of a phylogenetic tree and the posterior means of characteristic variables 
at each site in each branch. 

In the mechanistic codon substitution model, in which substitutions arc assumed to be in the station- 
ary state of a time-homogeneous reversible Markov process, the substitution probability matrix in time 
t is represented as exp Rt with a substitution rate matrix R, which is defined as 



R M u = C oast M^J^e w »» ioxn + v (28) 

Jy 
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where M M!/ is the mutation rate from codon /i to i>, /™ ut is the equilibrium frequency of codon v in 
nucleotide mutations, /„ is the equilibrium codon frequency, ^" ut e w ^ is the average rate of fixation, 

and Wfjtu is the selective constraints for mutations from \i to v; refer to |30j for details. Assuming that 
nucleotide mutations occur independently at each codon position but multiple nucleotide mutations in 
a codon can instantaneously occur, the mutation rate matrix M is approximated with 9 parameters; 
the ratios of nucleotide mutation rates, m tc \ ag / m [tc][ag] , m ag /m tc \ ag , m ta /m [tc][ag] , m tg /m [tc][ag] , and 
m ca/ m itc][ag}i the relative ratio m of multiple nucleotide changes, and the equilibrium nucleotide frequen- 
cies in nucleotide mutations, /™ ut , /" lut , and /™ ut . The selective constraint w^ v for a protein family 
is approximated with a linear function of the mean selective constraints that were evaluated [23] by 
ML-fitting a substitution matrix based on the mechanistic codon model to an empirical amino acid sub- 
stitution matrix. Here we use the mean selective constraints iv~ derived from the empirical amino acid 
substitution matrix LG [39]. The slope f3 and a constant term wq are parameters; = flwff + wo- 
The selective constraint w^ v is assumed to vary across sites and the variation of selective constraints 
[30] has been approximated by a discrete gamma distribution [31] with 4 categories. Thus, one more 
parameter is a shape parameter a for the discrete gamma distribution. In the result, 12 parameters in 
addition to the equilibrium frequencies of codons must be determined in this model. See [30] for full 
details of these parameters. 

The equilibrium frequencies of codons are estimated to be equal to codon frequencies in sequences of 
OTUs with the assumption of equal codon usage for amino acid sequences. Other 12 parameters were 
estimated by maximizing the likelihood of a NJ tree of Pfam seed sequences. Then, the ML estimates of 
the parameters obtained from the Pfam seed sequences are used to evaluate branch lengths and posterior 
means of characteristic variables at each site in each branch of NJ trees for the subsets of Pfam full 
alignments. NJ trees taken from the Pfam were used for the tree topologies, because optimizing tree 
topologies for more than a few thousands of sequences require too much computational time. Branch 
optimization of phylogenetic trees and posterior means of characteristic variables arc calculated using 
Phyml [40j modified for the mechanistic codon substitution model. 

Definition of contact residue pairs in protein structures 

Contact residue pairs are arbitrarily defined here as residue pairs whose minimum atomic distances are 
shorter than 5 A and which are separated by 6 or more residues along a peptide chain. This definition, 
especially the latter condition, which was used in Marks et al. |26j is employed here only for the comparison 
of the present predictions with their predictions of contact residue pairs. 

The PDB ID of a protein structure used for a target protein in each Pfam family is listed in Table [T] 
The amino acid sequences of these PDB entries arc just the same as those of the Uniprot IDs, which are 
also listed in Table [TJ 
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Figure Legends 



Topology: by the Neighbor joining method 

Branch lengths: by a ML method in a mechanistic codon substitution mode] 
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Co-evolution score based on partial correlation coefficients: 

Pij = max[py, max(— p\p 0), max(— pfj, 0), max(— pq, 0), ...] 
Pl = max {q p 0),p% = sgn Cfj (]^C§])V (z 6 {„, c , ftfe, ft, . . .}) 



Figure 1. Framework of the present model. See text for details. 
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Figure 2. Dependence of PPV on the number of predicted contacts. The dependences of the 
positive predictive values on the total number of predicted contacts are shown for each protein fold of 
a, /?, a + /3, and a//3. The solid and dotted lines show the PPVs of the present method and the method 
based on the DI score |26j , respectively. The total number of predicted site pairs is shown in the scale of 
the ratio of the number of predicted site pairs to the number of true contacts. The total number of 
predicted site pairs takes every 10 from 10 to a sequence length; also PPVs for the numbers of predicted 
site pairs equal to one fourth or one third of true contacts are plotted. The filled marks indicate the 
points corresponding to the number of predicted site pairs equal to one third of the number of true 
contacts. The number of sequences used here for each protein family is one listed in Table [TJ 
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Figure 3. Co-evolving site pairs versus DI residue pairs. Residue pairs whose minimum atomic 
distances are shorter than 5 A in a protein structure and co-evolving site pairs predicted are shown by 
gray filled-squares and by red or indigo filled-circles in the lower-left half of each figure, respectively. 
For comparison, such residue-residue proximities and predicted contact residue pairs with high DI 
scores in [26| are also shown by gray filled-squares and by red or indigo filled-circles in the upper-right 
half of each figure, respectively; only the conservation filter is applied but the filters based on a 
secondary structure prediction and for cysteine pairs are not applied to the DI scores. Red and indigo 
filled-circles correspond to true and false contact residue pairs, respectively. Residue pairs separated by 
five or fewer positions in a sequence may be shown with the gray filled-squares but are excluded in both 
the predictions. The total numbers of co-cvolving site pairs and DI residue pairs plotted for each 
protein are both equal to one third of true contacts (TP + FP = #contacts/3). The PPVs of both the 
methods for each protein arc listed in Table H 
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Figure 4. Dependence of PPV on the number of sequences used. The positive predictive 
values are plotted against the total number of homologous sequences used for each prediction. The filled 
marks indicate the points corresponding to the number of used sequences listed for each protein family 
in Table [T] The values written near each data point indicate the threshold value Tu\ OTUs connected 
to their parent nodes with branches shorter than this threshold value are removed in the NJ tree of the 
Pfam full sequences used for each prediction. Some data points correspond to datasets generated by 
using the same value of the threshold but by removing different OTUs. 
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Tables 

Table 1. Protein families used. 



Pfam ID a 


Seed" 


Full c 


Target protein domain 


Fold type 


No. sites/Length^ 








Uniprot ID d 


PDB ID e 






Trans_reg_C 


362 


35180 


OMPR.ECOLI/156-232 


10DD-A:156-232 


a 


76/77 


CH 


202 


5756 


SPTB2_HUMAN/ 176-278 


1BKR-A:5-107 


a 


101/103 


7tm_l 


64 


26656 


OPSD.BOVIN/54-306 


lGZM-A:54-306 


a (tm) 9 


248/253 


SH3.1 


61 


8993 


YES.HUMAN/97-144 


2HDA-A:97-144 


P 


48/48 


Cadhcrin 


57 


18808 


CADH1.HUMAN/267-366 


2072-A:113-212 


P 


91/100 


Trypsin 


71 


14720 


TRY2_RAT/24-239 


3TGI-E: 16-238 


P 


212/216 


Kunitz_BPTI 


151 


3090 


BPTl_BOVIN/39-91 


5PTI-A:4-56 


a + f3 


53/53 


KH_1 


399 


11484 


PCBP1.HUMAN/281-343 


lWVN-A:7-69 


a + /3 


57/63 


RRM.1 


79 


31837 


EL AV4_HUMAN /48- 1 18 


1G2E-A:41-111 


a + (3 


70/71 


FKBP_C 


174 


11034 


045418.CAEEL/26-118 


1R9H-A:26-118 


a + (3 


92/93 


Lectin_C 


44 


6530 


CD209.HUMAN/273-379 
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61 
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a Pfam release 26.0 (November 2011) was used. 

The number of sequences included in the seed alignment of the Pfam. 
c The number of sequences included in the full alignment of the Pfam. 
d Target protein domain in the Pfam family. 

e A protein structure corresponding to the target protein domain. 

' Unreliable site positions that are represented by the lower case of characters in alignments were excluded in the 
evaluation of prediction accuracy. 
9 Transmembrane a. 
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Table 2. Correlation (rAfA s ) versus partial correlation (rrr^ a s 77^ As ) coefficients of 
concurrent substitutions between sites. 
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1.8 
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0.56 
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1.8 
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2.2 
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2.4 


36:13 
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160:2401 
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1.9 


53:61 


0.46 


109:2677 
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202/110 


1.8 


72:87 


0.45 


101:3182 


0.03 
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1:10 


0.09 


RNase_H 


271/127 


2.1 


37:56 


0.40 


161:3700 


0.04 


72:3387 


0.02 


1:14 


0.07 


Ras 


329/158 


2.1 


81:55 


0.60 


203:6472 


0.03 


44:4768 


0.01 


1:9 


0.10 



a OTUs connected to their parent nodes with branches shorter than the threshold value T(, t are removed from 
each Pfam full alignment, and the number of remaining OTUs is listed. 

b The r t is a threshold for a correlation coefficient corresponding to the E-value Et = 0.001 (the P-value Pt = 
_Bt/n pa i rs ) in the Student's t-distribution of the degree of freedom, df = (2n tu~3) — 2, where n pa ; rs is the number 
of site pairs, and n tu is the number of OTUs. 

c TP and FP are the numbers of true and false positives, which are the number of contact site pairs and the 
number of non-contact site pairs in each category. Protein structures used to calculate contact residue pairs are 
listed in Table [T] Neighboring residue pairs within 5 residues (\i — j\ < 5) along a peptide chain are excluded in 
the evaluation of prediction accuracy. Also both terminal sites are excluded from counting in this table. 
d PPV stands for a positive predictive value; i.e., PPV = TP/(TP + FP). 
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Table 3. Co-evolution score (pf,) based on each characteristic variable. 



Characteristic 


Pij 




> n a 


Pij < 


-Pij 


< -n a 


variable 


T pb 




PPV C 


TP 


FP 


PPV 






over all protein families 




Substitution 


687 


642 


0.52 








Volume 


18 


20 


0.47 


73 


10 


0.88 d 


Charge 


6 


8 


0.43 


134 


54 


0.71 d 


Hydrogen bond 


4 


11 


0.27 


125 


51 


0.71 d 


Hydrophobicity 


23 


13 


0.64 d 


23 


10 


0.59 d 


a propensity 


14 


20 


0.41 


9 


10 


0.47 


/? propensity 


24 


17 


0.59 d 


30 


14 


0.68 d 


Turn propensity 


21 


18 


0.54 d 


17 


15 


0.53 d 


Aromatic interaction 


30 


10 


0.75 d 


1G 


14 


0.53 d 


Branched side-chain 


20 


16 


0.62 d 


20 


8 


0.71 d 


Cross link 


23 


12 


0.66 d 


5 


9 


0.36 


Ionic side-chain 


27 


15 


0.64 d 


14 


18 


0.44 



a See Eqs. [T] and for the definition of pfj . The r t is a threshold for a correlation coefficient corresponding to 

the E-value Et = 0.001 (the P-value Pt = St/n pa i rs ), in the Student's t-distribution of the degree of freedom, 

df = (2n tu — 3) — 2, where n pa ; rs is the number of site pairs, and n tu is the number of OTUs. 

6 TP and FP are the numbers of true and false contact residue pairs; protein structures used to calculate contact 

residue pairs are listed in Table Q] Neighboring residue pairs within 5 residues [\i — j\ < 5) along a peptide chain 

are excluded in the evaluation of prediction accuracy. Also both terminal sites are excluded from counting in this 

table. 

c PPV stands for a positive predictive value; i.e., PPV = TP/(TP + FP). 

d These PPVs are larger than the PPV for concurrent substitutions, i.e., 0.52 for p" . 
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Table 4. Accuracy of contact prediction based on the overall co-evolution score (p«). 



Pfam ID 


#contacts TP - 


h FP" 


PPV C 


MDPNT" 


MDTNP 6 




/#sites a 
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Pij 


DI 


Pij 


DI 


Pij 
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111/76 


27 
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1.30 


0.94 


4.20 


3.28 




1.5 


37 


0.432 
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1.72 


1.16 


3.64 


2.82 


CH 


172/101 


43 


0.488 


0.465 


2.23 


2.55 


4.59 


4.37 




1.7 


57 


0.439 


0.491 


2.12 


2.44 


3.70 


3.30 


7tm_l 


372/248 


93 


0.194 


0.344 


7.43 


5.31 


12.68 


7.71 




1.5 


124 


0.169 


0.306 


7.30 


5.33 


12.18 


6.40 


SH3.1 


89/48 


22 


0.636 


0.682 


0.83 


0.51 


1.69 


2.34 




1.9 


29 


0.552 


0.655 


1.15 


0.62 


1.56 


1.51 


Cadherin 


220/91 


55 


0.818 


0.836 


0.59 


0.25 


1.98 


1.98 




2.4 


73 


0.753 


0.767 


0.64 


0.45 


1.60 


1.60 


Trypsin 


636/212 


159 


0.591 


0.673 


1.75 


1.20 


3.26 


3.10 




3.0 


212 


0.533 


0.613 


2.26 


1.65 


2.83 


1.94 


Kunitz_BPTI 


111/53 


27 


0.444 


0.593 


1.40 


1.18 


2.31 


2.08 




2.1 


37 


0.541 


0.486 


1.13 


1.46 


1.86 


1.94 


KH_1 


90/57 


22 


0.500 


0.773 


0.99 


0.51 


2.41 


3.29 




1.6 


30 


0.533 


0.700 


1.07 


0.56 


2.16 


3.05 


RRM.1 


133/70 


33 


0.758 


0.818 


0.52 


0.55 


2.86 


2.36 




1.9 


44 


0.705 


0.795 


0.83 


0.49 


2.49 


1.84 


FKBP_C 


200/92 


50 


0.760 


0.840 


0.53 


0.69 


1.97 


1.85 




2.2 


66 


0.697 


0.727 


0.94 


0.85 


1.66 


1.51 


Lectin_C 


246/103 


61 


0.770 


0.705 


0.80 


0.94 


2.93 


2.67 




2.4 


82 


0.671 


0.646 


1.19 


1.17 


2.54 


2.32 


Thioredoxin 


188/99 


47 


0.532 


0.638 


0.98 


0.85 


3.43 


2.33 




1.9 


62 


0.565 


0.645 


0.94 


0.91 


3.16 


1.86 


Response_reg 


202/110 


50 


0.660 


0.680 


0.86 


0.88 


3.39 


3.06 




1.8 


67 


0.642 


0.687 


1.01 


0.92 


2.54 


2.29 


RNase_H 


273/128 


68 


0.559 


0.471 


1.51 


1.53 


3.61 


5.44 




2.1 


91 


0.549 


0.407 


1.55 


2.19 


3.27 


3.07 


Ras 


335/159 


83 


0.699 


0.699 


0.94 


1.05 


2.98 


3.68 




2.1 


111 


0.631 


0.685 


1.12 


1.45 


2.40 


2.51 



a Protein structures used to calculate contact residue pairs are listed in Table Q] Neighboring residue pairs within 

5 residues (\i — j\ < 5) along a peptide chain are not counted as contacts in the evaluation of prediction accuracy. 

6 TP and FP are the numbers of true and false positives; only predictions for TP + FP = #contacts/4 and 
#contacts/3 are listed. 

c PPV stands for a positive predictive value; i.e., PPV = TP/(TP + FP). Better values are typed in a bold font. 
d MDPNT stands for the mean Euclidean distance from predicted site pairs to the nearest true contact in the 
2-dimensional sequence-position space [26]. Better values are typed in a bold font. 

e MDTNP stands for the mean Euclidean distance from every true contact to the nearest predicted site pair in 
the 2-dimensional sequence-position space [26] . Better values are typed in a bold font. 

^ DI means the prediction based on the direct information (DI) score published in [26]; a filtering based on a 
secondary structure prediction is not applied but only a conservation filter [26] is. 
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Supporting Information 



Figure SI. Co-evolving site pairs versus DI residue pairs. Residue pairs whose minimum 
atomic distances are shorter than 5 A in a protein structure and co-evolving site pairs predicted are 
shown by gray filled-squares and by red or indigo filled-circles in the lower-left half of each figure, 
respectively. For comparison, such residue-residue proximities and predicted contact residue pairs with 
high DI scores in [26] are shown by gray filled-squares and by red or indigo filled-circles in the 
upper-right half of each figure, respectively; only the conservation filter is applied but the filters based 
on a secondary structure prediction and for cysteine pairs are not applied to the DI scores. Red and 
indigo filled-circles correspond to true and false contact residue pairs, respectively. Residue pairs 
separated by five or fewer positions in a sequence may be shown with the gray filled-squares but are 
excluded in both the predictions. The total numbers of co-evolving site pairs and DI residue pairs 
plotted for each protein are both equal to one third of true contacts (TP + FP = #contacts/3). The 
PPVs of both the methods for each protein are listed in Table |U 
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Figure S2. Dependence of MDPNT on the number of predicted contacts. The dependences 
of the mean Euclidean distance from predicted site pairs to the nearest true contact in the 2-dimensional 
sequence-position space on the total number of predicted contacts are shown for each protein fold of a, 
(3, a + (3, and a//3. The solid and dotted lines show the MDPNTs of the present method and the 
method based on the DI score [26], respectively. The total number of predicted contacts is shown in the 
scale of the ratio of the number of predicted contacts to the number of true contacts. The total number 
of predicted site pairs takes every 10 from 10 to a sequence length; also MDPNTs for the numbers of 
predicted contacts equal to one fourth or one third of true contacts are plotted. The filled marks 
indicate the points corresponding to the number of predicted site pairs equal to one third of the number 
of true contacts. The number of sequences used here for each protein family is one listed in Table [TJ 
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Figure S3. Dependence of MDTNP on the number of predicted contacts. The dependences 
of the mean Euclidean distance from every true contact to the nearest predicted site pair in the 
2-dimcnsional sequence-position space on the total number of predicted contacts are shown for each 
protein fold of a, (3, a + f3, and a/ (3. The solid and dotted lines show the MDTNPs of the present 
method and the method based on the DI score [26] , respectively. The total number of predicted site 
pairs is shown in the scale of the ratio of the number of predicted site pairs to the number of true 
contacts. The total number of predicted site pairs takes every 10 from 10 to a sequence length; also 
MDTNPs for the numbers of predicted site pairs equal to one fourth or one third of true contacts are 
plotted. The filled marks indicate the points corresponding to the number of predicted contacts equal 
to one third of the number of true contacts. The number of sequences used here for each protein family 
is one listed in Table [TJ 
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Figure S4. Dependence of MDPNT on the number of sequences used. The mean Euclidean 
distance from every predicted site pair to the nearest true contact in the 2-dimcnsional 
sequence-position space is plotted against the total number of homologous sequences used for each 
prediction. The filled marks indicate the points corresponding to the number of used sequences listed 
for each protein family in Table [TJ The values written near each data point indicate the threshold value 
Tbt] OTUs connected to their parent nodes with branches shorter than this threshold value are removed 
in the NJ tree of the Pfam full sequences used for each prediction. Some data points correspond to 
datasets generated by using the same value of the threshold but by removing different OTUs. 
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Figure S5. Dependence of MDTNP on the number of sequences used. The mean Euclidean 
distance from every true contact to the nearest predicted site pair in the 2-dimcnsional 
sequence-position space is plotted against the total number of homologous sequences used for each 
prediction. The filled marks indicate the points corresponding to the number of used sequences listed 
for each protein family in Table [TJ The values written near each data point indicate the threshold value 
Tbt] OTUs connected to their parent nodes with branches shorter than this threshold value are removed 
in the NJ tree of the Pfam full sequences used for each prediction. Some data points correspond to 
datasets generated by using the same value of the threshold but by removing different OTUs. 



